-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup regex compiler fixed quantifiers source #10843
Cleanup regex compiler fixed quantifiers source #10843
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10843 +/- ##
================================================
+ Coverage 86.30% 86.32% +0.02%
================================================
Files 144 144
Lines 22665 22668 +3
================================================
+ Hits 19560 19569 +9
+ Misses 3105 3099 -6
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve with some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops sorry I completed this review but I guess I forgot to actually submit it.
cpp/src/strings/regex/regcomp.cpp
Outdated
// get left-side (n) value => min_count | ||
exprp += transform_until(exprp, exprp + max_read, buffer.data(), "},"); | ||
auto count = std::atoi(buffer.data()); | ||
if ((*exprp != '}' && *exprp != ',') || (count > max_value)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're only reading at most 3 characters it is impossible to find count > max_value
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is kind of a weak link. If the max_value
changes to a smaller 3-digit value, the check would need to be re-added. This way, this line should never need to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But would there ever be a reason to use a number that isn't the largest number representable by max_read
digits? My understanding of the code was that the limitation was solely in place to control the width of the buffer reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd like to consider limiting max_value
to something like 255 in the future which would not change max_read
but require the count
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine with me. Just for my edification, what is the benefit of that choice? It doesn't have any performance implications unless someone actually requests a number that large at runtime, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am asking whether there is a difference between 255 and 999 if if a user never actually requests a number > 255. Based on
The number does contribute to the size of the working memory so may affect runtime performance.
it sounds like the answer is yes? You allocate memory based on the maximum number of repetitions somewhere? In that case, I assume that the amount of memory increases stepwise as the maximum hits crosses powers of two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing that sophisticated. Just maybe store the value in a smaller variable (e.g. uint8).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it doesn’t affect runtime performance, why arbitrarily limit at 999 and not the max value for the size of the repetitions variable? There’s some awkwardness in explaining this arbitrary limit. If it were INT_MAX, I think the docs might not even need to mention it. After all, string columns also have a length limit, right? It might be impossible to reach a repeat limit of INT_MAX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is just an arbitrary string of numbers entered by a human that is being converted to an integer so some error checking will need to be done since the string of decimal digits could be any length. Is there some reason not to limit it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline with @davidwendt. I don't think it's worth blocking on this point so I'm fine with accepting a limit of 999. We can change it later if users need more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidwendt I apologize sincerely -- I just noticed that I also started a PR review that I did not complete and submit. Here are a small number of comments from that partial review.
It looks like the review from @vyasr is much more complete, so I will excuse myself from additional review unless you need another look at anything.
| Greedy quantifier | `{n,m}` where `n` and `m` are integers: `0 ≤ n ≤ 999` and `n ≤ m ≤ 999` | Repeats the previous item between `n` and `m` times. Greedy, so repeating `m` times is tried before reducing the repetition to `n` times. | `a{2,4}` matches `aaaa`, `aaa` or `aa` | | ||
| Greedy quantifier | `{n,}` where `n` is an integer: `0 ≤ n ≤ 999` | Repeats the previous item at least `n` times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only `n` times. | `a{2,}` matches `aaaaa` in `aaaaa` | | ||
| Lazy quantifier | `{n,m}?` where `n` and `m` are integers `0 ≤ n ≤ 999` and `n ≤ m ≤ 999` | Repeats the previous item between `n` and `m` times. Lazy, so repeating `n` times is tried before increasing the repetition to `m` times. | `a{2,4}?` matches `aa`, `aaa` or `aaaa` | | ||
| Lazy quantifier | `{n,}?` where `n` is an integer: `0 ≤ n ≤ 999` | Repeats the previous item `n` or more times. Lazy, so the engine first matches the previous item `n` times, before trying permutations with ever increasing matches of the preceding item. | `a{2,}?` matches `aa` in `aaaaa` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General regex question: if this is lazy, how does its behavior differ from matching exactly n
repetitions? What would force it to match more repetitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I don't know. I think it depends on the previous character pattern. Here is an example with '.' as the repeat item:
>>> re.search('.{2,}b', 'aabcdefb')
<re.Match object; span=(0, 8), match='aabcdefb'>
>>> re.search('.{3,}b', 'aabcdefb')
<re.Match object; span=(0, 8), match='aabcdefb'>
>>> re.search('.{2,}?b', 'aabcdefb')
<re.Match object; span=(0, 3), match='aab'>
>>> re.search('.{3,}?b', 'aabcdefb')
<re.Match object; span=(0, 8), match='aabcdefb'>
Maybe there are better examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Bradley is asking how {n}
is different from {n,}?
, not how {n,}
is different from {n,}?
. Here are the extra two cases that need to be added to your examples:
>>> re.search('.{2}b', 'aabcdefb')
<re.Match object; span=(0, 3), match='aab'>
>>> re.search('.{3}b', 'aabcdefb')
<re.Match object; span=(4, 8), match='defb'>
The differences have to do with backtracking behavior and whether matching the entire regex requires that the lazy quantifier accept more characters. For example:
>>> re.search('a+b{2}a+', 'aaaabbbaaa')
>>> re.search('a+b{2,}?a+', 'aaaabbbaaa')
<re.Match object; span=(0, 10), match='aaaabbbaaa'>
In this case, an exact requirement of b{2}
won't match, because there are three. But the lazy quantifier says "OK, in that case I'll take some extra b characters and see if I can get it to match".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we resolve the question of the curly braces with >999 this gets a green light from me.
If the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@gpucibot merge |
Cleans up the `regcomp.cpp` source to fix class names, comments, and simplify logic around processing operators and operands returned by the parser. Several class member variables used for state are moved or eliminated. Some member functions and variables are renamed. Cleanup of the parser logic will be in a follow-on PR. Reference #3582 Follow on to #10843 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Yunsong Wang (https://github.com/PointKernel) URL: #10879
Cleans up the source for handling fixed quantifiers
{n,m}
used for repeating patterns using a range of values instead of just zero, one, or infinite. Hopefully this will help make this part of the regex parser/compiler easier to follow and maintain. There are many other items to cleanup (reference #3582) and this change concentrates mainly on the fixed quantifier handling.No function or behavior has changed but new gtests have been added that did not previously cover these quantifier combinations.